---
title: Golden Datasets
description: Use curated datasets to prevent regressions and ensure consistent prompt performance.
---

A "Golden Dataset" is a carefully curated collection of inputs and expected outputs that represents critical test cases and desired behaviors for your prompt. It serves as a benchmark to prevent regressions when making changes.

## Why Use a Golden Dataset?

- **Prevent Regressions**: Ensure that changes to your prompt (or underlying models) don't break previously working functionality or degrade quality on important cases.
- **Consistent Benchmarking**: Provide a stable baseline for comparing the performance of different prompt versions.
- **Confidence in Deployment**: Increase confidence that a new prompt version meets quality standards before publishing.
- **Capture Edge Cases**: Explicitly test how your prompt handles known difficult or important scenarios.

## Creating a Golden Dataset

1.  **Identify Critical Scenarios**: Determine the most important inputs or use cases your prompt must handle correctly.
2.  **Gather Examples**: Collect representative examples for these scenarios. Sources include:
    - Real production [Logs](/guides/logs/overview) (especially successful ones or interesting failures).
    - Manually crafted edge cases.
    - Existing test suites.
3.  **Define Expected Outputs (Ground Truth)**: For each input, define the ideal or minimally acceptable output. This might be:
    - An exact string.
    - A specific JSON structure.
    - Key information that must be present.
    - A classification label.
4.  **Format as CSV**: Structure this data into a CSV file with appropriate input columns (matching prompt parameters) and output columns (e.g., `expected_output`, `expected_category`).
5.  **Upload to Latitude**: [Upload the CSV as a new Dataset](/guides/datasets/overview#1-uploading-csv-files) in Latitude and give it a clear name (e.g., "Chatbot v2 - Golden Regression Set").
6.  **Marking the expected output column**: You can mark the expected output column as a 'label' by clicking on the column name and editing its role.

## Using the Golden Dataset in Workflows

- **During Development**: When iterating on a prompt in a draft version, run batch evaluations using relevant [Programmatic Rules](/guides/evaluations/programmatic-rules) (like Exact Match, Semantic Similarity, JSON Validation) against the golden dataset to check for regressions before considering the draft ready.
- **CI/CD Pipeline**: Integrate automated batch evaluations against the golden dataset into your pre-deployment checks. Fail the build if key metrics on the golden dataset drop below a threshold.
- **Version Comparison**: When comparing two prompt versions (e.g., A/B testing), run both against the golden dataset using the same evaluations to get a standardized performance comparison.

## Maintaining the Golden Dataset

- **Review Periodically**: Regularly review the golden dataset to ensure it still represents the most critical scenarios.
- **Add New Cases**: As new important use cases or failure modes are discovered in production, consider adding them to the golden dataset.
- **Version Control (Implicit)**: While datasets themselves aren't directly versioned _within_ Latitude like prompts, you can manage your source CSV files in your own version control system (like Git) if needed.

By establishing and maintaining golden datasets, you create a robust safety net for your prompt development lifecycle.

## Next Steps

- Learn more about [Creating and Using Datasets](/guides/datasets/overview)
- Set up [Programmatic Rule Evaluations](/guides/evaluations/programmatic-rules) to use with your dataset.
- Integrate checks into your [Team Workflows](/guides/evaluations/integrating-evaluations-workflow).
